Back

Protein Science

Wiley

Preprints posted in the last 90 days, ranked by how well they match Protein Science's content profile, based on 221 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.

1
GEF me a break: the consequences of freezing Rho guanine-nucleotide exchange factor catalytic domains

Anderson, L. K.; Barpal, E.; Mendoza, H.; Cash, J. N.

2026-04-09 biochemistry 10.64898/2026.04.08.717323 medRxiv
Top 0.1%
22.5%
Show abstract

Purified proteins are routinely flash frozen for use in functional and structural studies, providing a convenient way to reproduce results across complex experiments. Rho guanine-nucleotide exchange factors (RhoGEFs) are no exception to this practice, yet the effects of freezing on their activity and stability remain largely uncharacterized. This gap potentially affects the characterization of these important enzymes and how results are interpreted with respect to their prospective use as therapeutic targets. Here, we tested the isolated DH/PH tandems of P-Rex1, P-Rex2, and PRG under different cryoprotectant conditions and monitored activity and thermostability over time after flash freezing. Our results show a clear divergence between the activity of fresh and frozen purified RhoGEF protein samples in as little as one week for some conditions. Specifically, the variability in data collected on frozen samples was greatly increased. Despite these differences, thermostability seems to be preserved for much longer timepoints across RhoGEFs. Moreover, despite eventual changes in both activity and thermostability with respect to freezing, there are no obvious changes in global conformation between fresh and frozen samples of the isolated P-Rex2 DH/PH tandem. From our data, there are few generalizable trends between the different RhoGEFs and no single cryoprotective agent tested was a silver bullet to preserve both activity and thermostability across RhoGEFs. Overall, our findings emphasize the unpredictable effects of freezing RhoGEFs. As such, RhoGEF freezing should be carefully characterized for each protein and critically viewed when comparing analyses between different studies.

2
Structural divergence in N-terminal domains of AAA proteases paraplegin (SPG7) and FtsH indicates a key structural function in complex formation

Hyatt, J. G.; Paterson, N. G.; Devos, J. M.; Oliveira, C. L. P.; Prevost, S.; Jessen, c. M.; Hoffman, A.; Pedersen, J. S.; Winter, A.

2026-04-24 biochemistry 10.64898/2026.04.22.720153 medRxiv
Top 0.1%
22.3%
Show abstract

AAA proteases are hexameric ATP-dependent metallopeptidases that perform crucial proteolytic activities within prokaryotic and eukaryotic membranes. Structurally, protomers are comprised of catalytically active C-terminal domains that are anchored to the membrane by an N-terminal autonomous folding unit. In this study, we determined the fold, stability, and oligomeric state of the N-terminal intermembrane domains of human spastic paraplegia type 7 (SPG7)/ paraplegin protein and its bacterial orthologue FtsH using circular dichroism (CD), small-angle X-ray scattering (SAXS), small-angle neutron scattering (SANS) and X-ray crystallography. Solution-state analysis revealed that the N-terminal domain of paraplegin is a monomer in solution whereas FtsH forms a dimer. Unexpectedly, the N-terminal domain of paraplegin presents as a domain-swapped homodimer in our crystal structure that involves the first helix and first two beta-strands from one monomer and beta-strand 3, helix 2 and beta-strand 4 from another symmetry-related molecule. However, together they form an assembly which is similar to protomers observed for the N-terminal regions of FtsH and AfG3L2. Drawing from our structural data, we postulate that domain-swapping interactions of the N-terminal regions contribute to stability of the AAA protease hexamer containing paraplegin, demonstrating the extensive flexibility of the N-terminal portion of this protein and its role in achieving the appropriate molecular architecture required for function. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=87 SRC="FIGDIR/small/720153v1_ufig1.gif" ALT="Figure 1"> View larger version (26K): org.highwire.dtl.DTLVardef@1f4b9b5org.highwire.dtl.DTLVardef@1cc2242org.highwire.dtl.DTLVardef@dd211borg.highwire.dtl.DTLVardef@1a87722_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LIFtsH-IMS forms a homo-dimer in solution, whereas paraplegin-IMS presents as a well-folded monomer in solution C_LIO_LIparaplegin-IMS crystallises as a domain-swapped homo-dimer but its domain-swapped monomers are structurally similar to other IMS-regions C_LIO_LIAfG3L2/paraplegin hexamer formation may be supported by domain swapping in paraplegin-IMS C_LIO_LIdomain-swapping in paraplegin could be a Bonafide feature under certain cellular conditions and may be related to disease in spastic paraplegia C_LI

3
Global analysis of thermal and chemical denaturation using CheMelt: Thermodynamic dissection of highly thermostable de novo designed proteins

Lampinen, V.; Burastero, O.; Guazzelli, I. P.; Vogele, F.; Pinheiro, F.; Nowak, J. S.; Garcia Alai, M. M.; Kjaergaard, M.

2026-04-09 biophysics 10.64898/2026.04.07.716910 medRxiv
Top 0.1%
21.7%
Show abstract

De novo protein design often produces thermostable proteins that denature above 100 {degrees}C, which complicates the analysis of their stability. Thermostable proteins can be unfolded by combined chemical and thermal denaturation followed by global analysis of multiple melting curves. Here, we have developed CheMelt, a new online tool for global analysis of unfolding data via an intuitive graphical user interface. We use nanoscale differential scanning fluorimetry followed by CheMelt data analysis to dissect the combined thermal and chemical denaturation of thirty-five de novo designed protein binders. Fifteen present sufficient fluorescence changes to extract thermodynamic parameters of unfolding. These de novo designed proteins have systematically lower {Delta}Cp and m-values than comparable natural proteins, which implies that they expose fewer hydrophobic residues upon unfolding. We show that a high thermostability of a designed protein does not necessarily imply a high equilibrium stability; and demonstrate the potential of CheMelt in dissecting thermodynamic properties for protein design and engineering.

4
Structural basis for saccharide binding by human RNase 2/EDN, a protein combining enzymatic and lectin properties

Kang, X.; Prats-Ejarque, G.; Boix, E.; Li, J.

2026-03-23 biochemistry 10.64898/2026.03.20.713198 medRxiv
Top 0.1%
19.0%
Show abstract

Human RNase 2 (eosinophil-derived neurotoxin, EDN) is a major eosinophil granule protein of the vertebrate-specific RNase A superfamily and is involved in antiviral response and inflammation. Identifying ligand-binding pockets in EDN is thus relevant to structure-based drug design. In our laboratory we identified by protein crystallography a conserved site at the protein surface binding to carboxylic anion molecules (malonate, tartrate and citrate). Searching for potential biomolecules rich in anion groups and considering previous report of EDN binding to glycosaminoglycans, we explored the protein binding to saccharides. Next, EDN crystals were soaked with mono- and disaccharides, and the 3D structures of ten complexes were solved by X-ray crystallography at atomic resolution. We identified protein binding pockets to glucose, fucose, mannose, sucrose, galactose, trehalose, N-acetyl-D-glucosamine, N-acetylmuramic acid, and the sialic acid N-acetylneuraminic acid. A main site for glucose, fucose, and galactose was located adjacent to the spotted carboxylic anion site. Secondarily, N-acetylneuraminic acid, N-acetylmuramic acid, sucrose, galactose, and mannose shared another protein surface region. Overall, the saccharides clustered into seven defined sites, outlining a conserved recognition pattern, which was further analysed by molecular modelling. Interestingly, within the RNase A family, we find amphibian RNases that were initially isolated as carbohydrate binding proteins and named as leczymes, combining enzymatic and lectin properties. The present data is the first systematic structural characterization of a mammalian sugar-binding RNase within the family. The results highlight unique EDN residues that mediate its sugar specific interactions, of particular interest for a better understanding of the protein physiological role. HighlightsO_LIstructure of RNase 2 in complex with mono and disaccharides at atomic resolution C_LIO_LIidentification of RNase 2 unique sugar binding sites C_LIO_LIcharacterization of a mammalian RNase A family enzyme with lectin properties C_LI Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=110 SRC="FIGDIR/small/713198v1_ufig1.gif" ALT="Figure 1"> View larger version (46K): org.highwire.dtl.DTLVardef@1d805f7org.highwire.dtl.DTLVardef@16fcc49org.highwire.dtl.DTLVardef@ccfd92org.highwire.dtl.DTLVardef@1b8f1e_HPS_FORMAT_FIGEXP M_FIG C_FIG

5
Systematic Characterization of Thermal Stability Assay Parameters and Application in Discovery of Peptide-Protein Interactions

Richards, D. M.; zhai, F.; Li, S.; Yu, Q.

2026-05-08 biochemistry 10.64898/2026.05.06.723354 medRxiv
Top 0.1%
17.8%
Show abstract

Thermal proteome profiling (TPP) and its higher-throughput derivative, the proteome integral solubility alteration (PISA) assay, measure changes in protein thermal stability upon ligand binding or other perturbations and have been widely adopted in drug discovery and biomedical research. Though the PISA workflow is straightforward, key parameters, including detergent concentration, methods for removing denatured aggregates, and temperature range selection, vary across studies and can markedly influence assay outcomes. Yet these factors have not been systematically evaluated, limiting rational experimental design and data interpretation. Here, through a combined use of TPP, PISA, tandem mass tag (TMT)-based multiplexing, and computational simulation, we systematically characterize these parameters based on the melting behavior of [~]9,000 proteins. We find that reducing detergent concentration elevates apparent Tm by 1.5-2{degrees}C proteome-wide, and aggregate removal by filtration versus centrifugation further alters measurements. We leverage these observations to optimize PISA then apply the optimized conditions to identify the aminopeptidase NPEPPS as a previously uncharacterized binding partner of angiotensin II, a key vasoactive peptide hormone in blood pressure regulation. Together, this work provides a general framework for assay design and data interpretation, and extends the utility of PISA beyond small molecules to dissecting peptide-protein interactions, an increasingly important modality in drug discovery.

6
Rate limiting release of product underlies concave Arrhenius break point of thermolysin with a Phe-Leu-Ala substrate

Miller, J. J.; Bahnson, B. J.

2026-04-23 biochemistry 10.64898/2026.04.22.720203 medRxiv
Top 0.1%
15.1%
Show abstract

Thermolysin, a bacterial zinc metalloprotease, has been previously been reported to exhibit a biphasic kinetic temperature dependence of kcat with a characteristic convex shape. This convex shaping is observed for almost all enzymes which display an Arrhenius break; fumarase is the exception with concave shaping. Here, thermolysin kinetics measured with the tripeptide substrate N-[3-(2-furyl)acryloyl]-Phe-Leu-Ala (FAFLA) resulted in a concave Arrhenius plot, characterized by a 30 kJ/mol increase in enthalpy and entropy of activation, in contrast to the typical 30 kJ/mol decrease. Although the shape of the Arrhenius break differs, ionic strength and macromolecular crowding both attenuate the energetic magnitude of the break point, consistent with prior work. It was hypothesized that a different step of the catalytic cycle of thermolysin was represented by kcat with FAFLA to give rise to this new behavior. A 91% dependence of kcat on viscosity and modest solvent isotope effects, both distinct from previously-characterized substrates, indicated that a physical step was responsible for the observed Arrhenius concavity. Hinge bending conformational changes of thermolysin, monitored using the phosphoramidon inhibitor (a FAFLA mimic), exhibited a fully linear temperature dependence, excluding these large-scale motions as the origin of concavity. It was therefore proposed that release of the N-[3-(2-furyl)acryloyl]-Phe product is likely rate limiting since release was proposed to involve a two-step pathway to free the product coordinated to the catalytic Zn2+ of thermolysin. These findings provide a mechanistic framework for seldom-seen concave break point behavior and insights into the contribution of dynamics of physical processes to catalysis. IMPORTANCE AND IMPACTEnzymes which display Arrhenius break behavior provide insight into how dynamics impact catalysis. Almost every enzyme thus far displays convex biphasic shape, with concave shaping often not acknowledged. Thermolysin, which previously only showed convex shaping, displayed concave behavior with a tripeptide substrate. By linking this unusual kinetic behavior to a physical, not chemical, process, this work highlights the possible origin of a rare phenomenon which can expand understanding of protein dynamics and biphasic Arrhenius behavior.

7
Artificial intelligence aided design of peptides with custom secondary structure motifs and reduced amino acid alphabets

Brown, S. M.; Cohen, A. B.; Dean, S. N.

2026-05-01 bioinformatics 10.64898/2026.04.29.721096 medRxiv
Top 0.1%
14.9%
Show abstract

Proteins are highly diverse functional polymers where the specific sequence of amino acids, selected from a standard genetically-encoded alphabet of twenty (C20), determines the structure and ultimately the function of the resulting folded protein. This standard alphabet has been identified to be non-randomly distributed in physicochemical properties crucial to both structure-formation and function, often referred to as coverage theory. While machine learning models have drastically improved protein structure prediction, protein design has yet to have similar development. Here we therefore bridge contemporary biological theory with recent advancements in artificial intelligence (AI) to develop and evaluate a generative AI protein design model, trained on hundreds of thousands of proteins within the RSCB PDB, for custom secondary structure motifs using reduced amino acid alphabets. Results indicate an overall success in designing novel proteins with desired secondary structure motifs for a broad range of amino acid alphabets. Interestingly this tool often captures the full three-dimensional tertiary structure of a target protein despite training only on physicochemical sequence space and DSSP secondary structure. The development of this model advances research across multiple disciplines, from general scientific AI/ML architecture development to protein design for biotechnology, astrobiology, and early-Earth evolutionary biology.

8
Accurate protein stability prediction for small domains using mega-scale experiments

Cho, Y.; Tsuboyama, K.; Litberg, T. J.; Jung, M. D.; Obisesan, A.; Wang, Q.; Phoumyvong, C. M.; Thibeault, J.; Ovchinnikov, S.; Rocklin, G. J.

2026-05-20 biophysics 10.64898/2026.05.19.726285 medRxiv
Top 0.1%
14.0%
Show abstract

Predicting absolute protein folding stability is a long-standing challenge in biophysics, with broad applications in protein design and in understanding genetic variation and evolution. Physics-based simulations have shown limited success at predicting stability and are often computationally intractable, and machine learning methods have been constrained by the lack of sufficiently large experimental datasets. We recently introduced cDNA display proteolysis, a cell-free approach that can measure folding stability for nearly one million protein domains in parallel. Here, we applied this method to measure stability for 1.8 million diverse protein domains 60-80 amino acids in length primarily taken from the MGnify metagenomic database and spanning over 200,000 sequence families. Using this new "MGnify Stability dataset", we developed the predictive models SaProt{Delta}G and ESM3{Delta}G, which accurately predict absolute folding stability for small domains with root mean squared error of 0.8 kcal/mol over a 6 kcal/mol range (Spearman rank correlation of 0.88). These predictors show high accuracy at predicting effects of substitutions, insertions, and deletions, successfully identify global trends toward higher stability in thermophilic organisms, and improve discrimination of stable and unstable computationally designed proteins. Our results illustrate how megascale biophysical measurements can complement existing evolutionary and structural data to enable accurate absolute stability prediction for small domains.

9
Multi-objective Engineering of Trimethylamine Monooxygenase for Improved Thermostability and Cofactor Use

Xiang, R.; Floor, M.; Ree, R.; Canellas-Sole, A.; Puntervoll, P.; Roda, S.; Elin Kjaereng Bjerga, G.; Guallar, V.

2026-04-12 molecular biology 10.64898/2026.04.10.717641 medRxiv
Top 0.1%
12.5%
Show abstract

Trimethylamine (TMA) is a major contributor to undesirable odours in protein hydrolysates derived from marine by-products, limiting their industrial use. Flavin-containing monooxygenases (FMOs) catalyse the conversion of TMA to the odourless trimethylamine N-oxide (TMAO); however, industrial applications demand enzymes that are both thermally stable and compatible with cost-effective cofactors. A thermostable variant of the Methylophaga aminisulfidivorans FMO (mFMO_20) can function at elevated temperatures but depends exclusively on the expensive and unstable cofactor NADPH. In this study, we investigated whether it is possible to simultaneously enhance thermostability and NADH compatibility using a multi-objective engineering strategy. We first targeted residues in the cofactor binding site of mFMO_20 to restore NADH activity, which had been completely lost despite the wild type enzyme being naturally active with both cofactors. Variants derived from the thermostable scaffold partially recovered NADH activity but showed reduced NADPH activity. Given the wild types inherent NADH compatibility, we next pursued a stability-improvement approach, introducing highly conserved stabilizing mutations. This preserved cofactor competence but produced only modest improvements in thermostability. Finally, by combining physical, evolutionary, and statistical metrics, we obtained variants that retained higher NADPH activity after heat treatment than any previously reported thermostable mutants, while a subset also retained measurable NADH activity before heat treatment. These findings show that combining complementary scoring strategies helps navigate the trade-off between stability and activity; while, robust NADH function under thermal stress remains elusive, with only one variant retaining detectable NADH activity after heat treatment, the results provide valuable insight into the underlying constraints linking stability and cofactor usage and highlights possible directions for engineering FMOs with both enhanced thermostability and cofactor compatibility. Author summaryIn this work, we aimed to improve an enzyme that could be useful in industrial applications but is limited by two common constraints: poor stability at high temperatures and dependence on an expensive cofactor. To make the enzyme more suitable for large-scale applications, we sought to engineer variants that are both more thermostable and compatible with a cheaper cofactor, NADH. For enzyme engineering, we used a strategy that balances several properties rather than prioritizing a single trait. We combined tools that capture evolutionary patterns, protein physics, and AI-based predictions to explore which mutations might provide the right combination of stability and function. Through this approach, we obtained variants with improved heat resistance and higher cofactor activity retention.

10
Structure of human aldehyde oxidase under tris(2-carboxyethyl)phosphine-reducing conditions

Videira, C.; Esmaeeli, M.; Leimkuhler, S.; Romao, M. J.; Mota, C.

2026-03-25 biochemistry 10.64898/2026.03.25.713928 medRxiv
Top 0.1%
12.3%
Show abstract

The importance of human aldehyde oxidase (hAOX1) has increased over the last decades due to its involvement in drug metabolism. Inhibition studies concerning hAOX1 are extensive and a common reducing agent, dithiothreitol (DTT), was recently found to inactivate the enzyme. However, in previous crystallographic studies of hAOX1, DTT was found to be essential for crystallization. To surpass this concern another reducing agent used in crystallization trials. Using tris(2-carboxyethyl)phosphine (TCEP), a sulphur-free reducing agent, it was possible to obtain well-ordered crystals from hAOX1 wild type and variant, hAOX1_6A, which diffracted beyond 2.3 [A]. Instead of the typical star-shaped crystals of hAOX1, at pH 4.7, plates are obtained in the orthorhombic space group (P22121) with two molecules in the asymmetric unit. Activity assays with the enzyme incubated with both reducing agents show that contrary to DTT, TCEP does not lead to irreversible inactivation of the enzyme. The replacement of DTT with TCEP in crystallization of hAOX1 provides a strategy to circumvent enzyme inactivation during crystallographic studies, allowing future applications of new assays, such as time-resolved crystallography.

11
Do AI Models for Protein Structure Prediction Get Electrostatics Right?

Makhatadze, G. I.

2026-03-13 biophysics 10.64898/2026.03.11.711144 medRxiv
Top 0.1%
12.1%
Show abstract

A variant of the U1A protein containing four substitutions to ionizable residues was generated serendipitously due to a miscommunication. Biophysical measurements show that this variant has at least twice as much helical structure as the wild-type U1A and is trimeric in solution, in contrast to the monomeric wild type. In sharp contrast, structures predicted by deep-learning AI tools (AlphaFold2 and RoseTTAFold2) and transformer-based tools (OmegaFold and ESMFold) are all highly similar to the wild-type U1A (backbone RMSD < 1 [A]). Even more surprising, two of the substituted ionizable residues are predicted to be fully buried in the non-polar core of the protein, an outcome that contradicts well-established physico-chemical principles, as ionizable residues are normally located on the protein surface. To explore this effect further, we generated sequences containing up to all twelve residues that make up the non-polar core of U1A. Across thousands of sequences, and depending on the AI model used, the majority of predicted structures contained fully buried ionizable residues while still maintaining the overall U1A fold. We then examined two additional proteins of comparable size, acylphosphatase and the de novo-designed TOP7 fold, and observed the same phenomenon: AI models frequently predicted structures with buried ionizable residues that nevertheless retained the parent fold. When these AI-predicted structures were subjected to short (50 ns) molecular dynamics simulations using physics-based force fields such as CHARMM or AMBER, the structures rapidly relaxed into ensembles that exposed ionizable residues. We conclude that while AI-based structure prediction tools perform extremely well on naturally occurring sequences, they do not reliably encode the physico-chemical principles governing the placement of ionizable residues. A straightforward remedy is to include a brief molecular dynamics simulation as a final validation step for AI-generated structures.

12
IDPForge: Deep Learning of Proteins with Global and Local Regions of Disorder

De Castro, S.; Zhang, O.; Liu, Z. H.; Forman-Kay, J. D.; Head-Gordon, T.

2026-03-27 biophysics 10.64898/2026.03.25.714313 medRxiv
Top 0.1%
10.4%
Show abstract

Although machine learning has transformed protein structure prediction of folded protein ground states with remarkable accuracy, intrinsically disordered proteins and regions (IDPs/IDRs) are defined by diverse and dynamical structural ensembles that are predicted with low confidence by algorithms such as AlphaFold and RoseTTAFold. We present a new machine learning method, IDPForge (Intrinsically Disordered Protein, FOlded and disordered Region GEnerator), that exploits a transformer protein language diffusion model to create all-atom IDP ensembles and IDR disordered ensembles that maintains the folded domains. IDPForge does not require sequence-specific training, back transformations from coarse-grained representations, nor ensemble reweighting, as in general the created IDP/IDR conformational ensembles show good agreement with solution experimental data, and options for biasing with experimental restraints are provided if desired. We envision that IDPForge with these diverse capabilities will facilitate integrative and structural studies for proteins that contain intrinsic disorder, and is available as an open source resource for general use.

13
A conserved isoleucine gates the diffusion of small ligands to the active site of NiFe CO-dehydrogenase

Opdam, L.; Meneghello, M.; Guendon, C.; Chargelegue, J.; Fasano, A.; Jacq-Bailly, A.; Leger, C.; Fourmond, V.

2026-03-21 biochemistry 10.64898/2026.03.19.713016 medRxiv
Top 0.1%
10.2%
Show abstract

CO dehydrogenases (CODH) are metalloenzymes that reversibly oxidize CO to CO2, at a buried NiFe4S4 active site. The substrates, CO and CO2, need therefore to be transported through the protein matrix to reach the active site. The most likely pathway for intra-protein diffusion is the hydrophobic channel identified in the crystal structures. Here, we use site-directed mutagenesis to study the highly conserved isoleucine 563 of Thermococcus sp. AM4 CODH2. Mutations at this position change the biochemical properties (KM for CO, product inhibition constant, catalytic bias...), and increase the resistance of the enzyme to the inhibitor O2, showing that isoleucine 563 indeed lines the gas channel. The I563F mutation decreases the bimolecular rate constant of inhibition by O2 15-fold, and increases the IC50 20-fold, which is the strongest improvement in O2 resistance reported so far. We show that the size of the introduced amino acids is less important than their flexibility - along with the size of the cavity formed near the active site in the channel. We also conclude that O2 access to the active site cannot be slowed down without also affecting CO diffusion. This tradeoff will have to be considered in further attempts to use site-directed mutagenesis to make CODHs more O2 tolerant.

14
CROWN: Curated Repository Of Well-resolved Noncovalent interactions

Poelmans, R.; Van Eynde, W.; Bruncsics, B.; Bruncsics, B.; Arany, A.; Moreau, Y.; Voet, A. R.

2026-04-01 bioinformatics 10.64898/2026.03.30.714168 medRxiv
Top 0.1%
10.1%
Show abstract

AbstractThe development of machine learning models for protein-ligand interactions is fundamentally constrained by the quality and diversity of available structural data. Existing databases of protein-ligand complexes present researchers with an unsatisfying trade-off: carefully curated collections such as PDBBind and HiQBind offer high structural reliability but cover only a narrow slice of the Protein Data Bank (PDB), while large-scale resources like PLInder provide broad coverage at the expense of rigorous quality control. Here, we introduce CROWN (Curated Repository Of Well-resolved Non-covalent interactions), a machine learning-ready dataset that reconciles this tension by applying a comprehensive, fully automated preprocessing pipeline to the PLInder database. Starting from 649,915 protein-ligand interaction systems, CROWN applies a series of interleaved quality filters and processing stages addressing crystallographic resolution, ligand identity, pocket completeness, structural repair, interaction quality, and protonation at physiological pH. A distinguishing feature of the pipeline is a final constrained energy minimisation step using custom flat-bottomed restraints, which balances crystallographic evidence with relaxation of intramolecular strain. This step -- absent from existing protein-ligand datasets -- produces structurally uniform complexes by reconciling the heterogeneous refinement practices of different crystallographers and structure determination protocols, without distorting the experimentally observed binding geometry. The resulting dataset of 153,005 complexes represents a roughly four-fold increase in protein and species diversity over PDBBind and HiQBind, while maintaining rigorous structural standards. Importantly, CROWN adopts a geometry-centric design philosophy that treats the 3D arrangement of atoms at the binding interface as a self-consistent source of information, rather than relying on externally measured binding affinities that cover only a fraction of known structures and introduce well-documented biases. We anticipate that CROWN will serve as a broadly useful resource for training generative models of protein-ligand binding poses, developing scoring functions, and benchmarking interaction prediction methods.

15
An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings

Lala, J.; Agrawal, H.; Dong, F.; Wells, J.; Angioletti-Uberti, S.

2026-03-05 bioinformatics 10.64898/2026.03.04.709378 medRxiv
Top 0.1%
10.1%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWWe present a general approach to find amino acid sequences corresponding to the most compact enzyme likely to retain the structure of a given catalytic site. Our approach is based on using Monte Carlo (MC) simulations to sample an energy landscape where minima correspond, by construction, to sequences with the aforementioned properties. Building on previous work (Wu et al., 2025) and with the BAGEL package (Lala et al., 2025), we implement a route to achieve this goal using only the information extracted from a protein language model (PLM), without structural information. After generating a set of candidate sequences with this PLM-guided BAGEL optimization, we further filter potential candidates for downstream experimental validation using a two-stage protocol. First, deep-learning-based structure prediction models (ESMFold, Chai-1, Boltz-2) are used to identify a structural consensus among designs with highly conserved active-site geometries, yielding many candidates with active-site RMSD below a few angstroms relative to the wild-type and pLDDT scores above 80. Second, molecular dynamics simulations are performed on a filtered subset of sequences (based on active-site RMSD and SolubleMPNN log-likelihoods) to evaluate active-site stability when including thermal fluctuations. For the most promising enzymes, these yield RMSF values in the active site below 1.0 [A] and an active-site RMSD drift between 0.5 and 1.5 [A], making these mini-variants comparable to the wild type, though outcomes vary across enzymes. Given the protocols generality, we believe these results represent a step forward in AI-guided enzyme design. To facilitate rapid experimental validation by the broader community, we open-source all sequences generated by our computational pipeline. These include designs for four representative enzymes of this study: PETase, subtilisin Carlsberg (serine protease), Taq DNA polymerase, and VioA.

16
Evaluating FoldX5.1 for MAVISp Stability Data Collection

Vliora, A.; Tiberti, M.; Papaleo, E.

2026-04-02 bioinformatics 10.64898/2026.03.31.715598 medRxiv
Top 0.1%
10.1%
Show abstract

MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.

17
Confidence Without Verification: Screening pLDDT Unreliability in AlphaFold2 Fold-Switching Predictions

Thacker, R.

2026-03-05 bioinformatics 10.64898/2026.02.19.706878 medRxiv
Top 0.1%
9.6%
Show abstract

AlphaFold2 (AF2) has transformed structural biology, yet its confidence metrics, particularly the predicted Local Distance Difference Test (pLDDT) and predicted Template Modelling score (pTM), systematically fail for fold-switching proteins, which adopt two or more distinct conformations from a single amino acid sequence. Multiple groups have called for alternative quality measures to address this limitation. Here, we present a confidence-integrity screening method derived from the SDI framework [Thacker, 2025a], originally developed to detect reasoning failures across artificial intelligence architectures. We apply this screen to 27,098 predictions across 18 experimentally validated fold-switching proteins from the Porter fold-switching benchmark [Sala et al., 2023, Lee et al., 2025], identifying a 33.6% high-confidence FalseVerify rate, defined as predictions where AF2 is simultaneously confident about its output and structurally committed to the dominant fold while failing to capture the known alternative conformation. FalseVerify severity is predictable: complete secondary structure refolding produces 80-97% FalseVerify rates, while local backbone rearrangements produce 0-2%. Per-residue analysis localizes confidence failures to specific fold-switching regions, with cross-cluster pLDDT variance serving as a structural fingerprint distinguishing reliable from unreliable predictions even when mean pLDDT values are indistinguishable. Applied blindly to 52 unvalidated E. coli fold-switching candidates from the CF-random proteome search, the taxonomy produces structured categories, not noise, with a perfect 50/50 directional split confirming zero population-level correlation between pLDDT and conformational accuracy. The blind screen also identifies a sixth category, INVERTED confidence, invisible in benchmark data, in which the alternative conformation is more confident than the dominant. Nine E. coli proteins with balanced confidence profiles are prioritized for experimental validation. These results fill a specific methodological gap identified by Schafer & Porter (2025) and provide a quantitative framework for triaging fold-switching predictions before experimental validation.

18
Sequence determinants of the hypomobility of intrinsically disordered proteins in SDS-PAGE

Garg, A.; Gielnik, M. B.; Kjaergaard, M.

2026-03-25 biophysics 10.64898/2026.03.24.714011 medRxiv
Top 0.1%
9.0%
Show abstract

Proteins with intrinsically disordered regions (IDRs) migrate at a higher apparent molecular weight in sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) complicating their analysis and identification. Here, we investigate the sequence determinants of the hypomobility of IDRs using a series of synthetic low complexity domains. We find that negative charge increases the apparent molecular weight, but neutral polar tracts also have abnormally slow migration. Positive charge and hydrophobic residues decrease the apparent molecular weight, although lysine residues show a biphasic effect with decreased migration at high fractional contents. Combinations of residues show that different sequence contributions to the apparent molecular weight are not additive. The results can be rationalized by the protein-decorated micelle model by considering both SDS binding and the compaction of protein SDS-complexes.

19
The turn less taken: Investigating patterns in β-turn dynamics using large-scale molecular dynamics data

Zhang, S.; Maddipatla, S. A.; Vedula, S.; Marx, A.; Bronstein, A. M.

2026-05-08 biochemistry 10.64898/2026.05.07.721674 medRxiv
Top 0.1%
8.9%
Show abstract

{beta}-turns are among the most common structural motifs in proteins, yet their conformational dynamics and sequence determinants remain incompletely understood. Here we present a data-driven classification and dynamic analysis of {beta}-turn conformations using large-scale molecular dynamics trajectories from the mdCATH database. Clustering of backbone dihedral angles using a cross-bond Ramachandran representation identifies six {beta}-turn types, including a previously uncharacterized hybrid I/I' cluster that combines geometric features of canonical type I and I' conformations. Time-resolved analysis indicates that this hybrid state acts as a transient intermediate state of {beta}-turns. Transitions observed in molecular dynamics simulations closely match NMR ensembles and altlocs detected in X-ray crystal structures, with the most dominant exchanges occurring between type I and II, and between type I' and II' turns. Sequence analysis shows that each turn type exhibits characteristic amino acid preferences at the central residues (i + 1 and i + 2). Within these overall preferences, specific residue pairs display distinct biases toward static or dynamic behavior. Targeted in silico substitutions that interchange dynamic- and static-enriched residue pairs shift the conformational behavior of turns accordingly, providing direct support for these sequence-dynamics relationships. Analysis of flanking secondary-structure environments reveals that structural context further modulates turn flexibility, with strand- and coil-associated turns exhibiting higher dynamic propensity than helix-associated turns. Together, these results reveal how sequence composition and structural context jointly shape the conformational landscape of {beta}-turns.

20
Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Peteani, G.; Sgueglia, G.; Lemmin, T.; Chino, M.

2026-05-05 bioinformatics 10.64898/2026.04.30.722007 medRxiv
Top 0.1%
8.8%
Show abstract

MotivationProtein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality. ResultsWe fine-tuned the generalist model ProtGPT2 on synthetic sequences generated by the inverse-folding model ProteinMPNN, constructing training sets with controlled variation in size and diversity. Fine-tuning increased recovery of canonical metal-binding motifs from 43% in the baseline model to 91% in the fine-tuned models. Generated sequences retained high predicted structural confidence and structural similarity to known folds, despite low sequence identity. Analysis of latent representations from ProtGPT2 indicated that fine-tuned models occupy distinct regions of embedding space relative to both the baseline model and structure-conditioned sequences, consistent with partial incorporation of structural constraints while preserving sequence diversity. A multi-step filtering pipeline applied to sequences lacking canonical motifs identified candidate metal-binding sites in four-helical bundle topologies not detected in a non-redundant subset of Protein Data Bank structures or in AlphaFold-predicted proteomes. Availability and implementationCode, trained models, and datasets are available at: https://doi.org/10.5281/zenodo.18672158 and https://huggingface.co/gsgueglia.